Empirical Q-Value Iteration
نویسندگان
چکیده
We propose a new simple and natural algorithm for learning the optimal Q-value function of a discounted-cost Markov Decision Process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as Q-learning and ‘actor-critic’ algorithms, this algorithm doesn’t depend on a stochastic approximation-based method. We show that our algorithm, which we call the empirical Q-value iteration (EQVI) algorithm, converges almost surely to the optimal Q-value function. To the best of our knowledge, this is the first algorithm for learning in MDPs that guarantees an almost sure convergence without using stochastic approximations. We also give a rate of convergence or a non-aymptotic sample complexity bound, and also show that an asynchronous (or online) version of the algorithm will also work. Preliminary experimental results suggest a faster rate of convergence to a ball park estimate for our algorithm compared to stochastic approximation-based algorithms. In fact, the asynchronous setting EQVI vastly outperforms the popular and widely-used Q-learning algorithm.
منابع مشابه
CFQI: Fitted Q-Iteration with Complex Returns
Fitted Q-Iteration (FQI) is a popular approximate value iteration (AVI) approach that makes effective use of off-policy data. FQI uses a 1-step return value update which does not exploit the sequential nature of trajectory data. Complex returns (weighted averages of the n-step returns) use trajectory data more effectively, but have not been used in an AVI context because of off-policy bias. In ...
متن کاملA Tutorial on Linear Function Approximators for Dynamic Programming and Reinforcement Learning
A Markov Decision Process (MDP) is a natural framework for formulating sequential decision-making problems under uncertainty. In recent years, researchers have greatly advanced algorithms for learning and acting in MDPs. This article reviews such algorithms, beginning with well-known dynamic programming methods for solving MDPs such as policy iteration and value iteration, then describes approx...
متن کاملExploiting Multi-step Sample Trajectories for Approximate Value Iteration
Approximate value iteration methods for reinforcement learning (RL) generalize experience from limited samples across large stateaction spaces. The function approximators used in such methods typically introduce errors in value estimation which can harm the quality of the learned value functions. We present a new batch-mode, off-policy, approximate value iteration algorithm called Trajectory Fi...
متن کاملDifference of Convex Functions Programming for Reinforcement Learning
Large Markov Decision Processes are usually solved using Approximate Dynamic Programming methods such as Approximate Value Iteration or Approximate Policy Iteration. The main contribution of this paper is to show that, alternatively, the optimal state-action value function can be estimated using Difference of Convex functions (DC) Programming. To do so, we study the minimization of a norm of th...
متن کاملAn Empirical Investigation of the Relation between Corporate Sustainability Performance (CSP) and Corporate Value: Evidence from Iran
This study provides an empirical evidence on how Corporate Sustainability Performance (CSP), is reflected in the corporate value. Using a theoretical framework combining Legitimacy theory, Stakeholder theory and Agency theory, a set of hypotheses that relate the corporate value to CSP is examined. For a sample of Iranian firms, 28 components with four dimensions as Community, Environment, Emplo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1412.0180 شماره
صفحات -
تاریخ انتشار 2014